Real-world images are notoriously complex and variable due to both photometric changes of images (such as varying illuminations and shadows) as well as geometric changes (such as view point variations, occlusions, etc.). Nevertheless, humans can readily perceive the semantic meaning of an image: what objects are present, what the scene environment is, and what kind of activities are taking place. It brings forward the problem of semantic image understanding, i.e. developing computer vision algorithms to effectively extract useful meanings from the vast amount of visual information. Such research has applications for inferring social interaction through seamless sharing photos, automatic multimedia library indexing, retrieval and organization, educational, and clinical assistive technology, and security systems. I am going to talk about my work of using machine learning techniques to tackle the challenging problems of large scale real-world semantic image understanding. I will first give a brief review of my research on modeling real world images for a number of fundamental recognition tasks: object classification and detection, scene classification, image annotation, segmentation and automatic image hierarchy construction. My algorithms have been well tested both on large scale real-world data such as those available on the Internet and a competition designed to simulate the real-world robot perception scenario. I'll then introduce a fundamentally new image representation for semantic understanding of images. This representation carries rich semantic and spatial information of objects within an image. When tackling higher level visual recognition problems, I show that our representation is powerful on high level visual tasks. It significantly outperforms the low level image representations and the state-of-the-art approaches on several benchmark datasets.